Take Home Exercise 3

Author

Teng Kok Wai (Walter)

Published

October 29, 2024

Modified

November 10, 2024

1 Assignment Task

Refer to: Take-home Exercise 3b: Predicting HDB Resale Prices with Geographically Weighted Machine Learning Methods – ISSS626 Geospatial Analytics and Applications

2 Overview

HDB flats play a central role in Singapore’s housing landscape, providing homes for the majority of its population. According to Census of Population 2020, 87.9% of residents own their homes, and 78.7% of households live in HDB flats. This prompts an essential question for every resident: What drives HDB resale prices? Could it be the flat’s location and proximity to essential services, or could it be more influenced by factors like flat type and size?

3 Objective

This project is especially meaningful to me as an aspiring homeowner interested in owning a 5-room HDB flat. My goal is to apply what I’ve learned in this module by sourcing, curating, and cleaning both geospatial and aspatial data from various sources, including data.gov.sg, OneMap, LTA Data Mall, and HDB, to build a comprehensive dataset for predictive modeling.

Using resale data for 5-room HDB flats from January to December 2023, I will train and compare three models: an OLS Multiple Linear Regression Model, a Random Forest and a Geographically Weighted Random Forest (gwRF) Model. These models will be used to predict 5-room HDB resale prices in Singapore for July to September 2024.

4 Installing and Launching the R Packages

The following R packages will be used in this exercise:

Package Purpose Use Case in Exercise
dotenv Manages environment variables Stores and retrieves API tokens securely for OneMap API access.
sf Handles spatial data; imports, manages, and processes geospatial data Manages geospatial data, such as Singapore’s boundary data and geocoded resale data.
onemapsgapi Accesses OneMap API for geospatial data Retrieves geospatial data like community clubs, libraries, and parks for analysis.
httr Facilitates HTTP requests and API interactions Sends API requests to OneMap to retrieve geospatial data for specified themes.
tidyverse Provides tools for data manipulation and visualization Cleans and processes datasets, such as merging geospatial features and handling missing values.
xml2 Parses and processes XML data Reads KML files for features like hawker centers and supermarkets.
rvest Scrapes web data Extracts data from Wikipedia for shopping mall locations.
jsonlite Processes JSON data Reads JSON files, such as primary school data retrieved from OneMap’s undocumented API.
units Handles unit conversions Converts and manages units when calculating proximity distances in meters.
matrixStats Performs fast matrix operations Efficiently calculates proximity distances in bulk, like computing minimum distances to amenities.
tmap Provides tools for thematic maps Visualizes spatial distributions of HDB resale prices and proximity features across Singapore.
ggplot2 Creates static graphics Plots distributions of resale prices and feature relationships.
plotly Creates interactive plots Visualizes variable importance and residuals in interactive bar and scatter plots.
ggstatsplot Enhances ggplot2 with statistical analysis features Creates correlation matrices for multicollinearity checks.
ggpubr Simplifies publication-ready visualizations Arranges multiple ggplot2-based plots for side-by-side comparison.
olsrr Provides tools for regression model diagnostics Performs diagnostic checks on the multiple linear regression model, including residual analysis.
knitr Formats data into tables for reporting Generates well-formatted tables to present RMSE and MAE scores of models.
kableExtra Enhances knitr tables Styles tables for readability and professional presentation.
spdep Analyzes spatial dependence Conducts Moran’s I Test to evaluate spatial autocorrelation in model residuals.
sfdep Provides spatial statistics with sf objects Generates distance-based weights for spatial autocorrelation tests, like Moran’s I.
ranger Builds fast, memory-efficient random forests Constructs a Random Forest model for predicting HDB resale prices.
GWmodel Provides geographically weighted regression models Fits a Geographically Weighted Random Forest (gwRF) model for spatially varying predictions.
SpatialML Supports machine learning on spatial data Utilizes spatial machine learning techniques for geographically weighted modeling.
Metrics Calculates common evaluation metrics Computes RMSE and MAE to evaluate model accuracy on the test set.
caret Provides tools for model training and tuning Supports model training and feature importance extraction in regression and machine learning contexts.

As compared to previous Take-Home exercse, the data acquisition process for this exercise is quite involved as it requires us to identify the relevant data, locate reliable sources, verify the data’s suitability for our use case, and perform thorough sanity checks.

Note

During this process, I stumbled upon Megan’s work which provided valuable guidance. However, the process was not without challenges. Data sources on official government portals had changed, and the jolene-lim/onemapsgapi library was no longer functional. To address this, I forked the library, made adjustments to handle the latest OneMap API changes, and used this modified version, davzoku/onemapsgapi, for my data acquisition process.

To install the forked development version of onemapsgapi, we can use the devtools library to install the library from the source code directly from GitHub:

library(devtools)
devtools::install_github("davzoku/onemapsgapi")

For other libraries, we can use pacman as per usual:

pacman::p_load(dotenv, sf, onemapsgapi, httr, tidyverse, xml2, rvest, jsonlite, units, matrixStats, tmap, ggplot2, plotly, ggstatsplot, ggpubr, olsrr, knitr, kableExtra, spdep, sfdep, ranger, GWmodel, SpatialML, Metrics, caret)

5 Model Factors

To build the model, we will examine the following factors as recommended on the assignment task. Note that after studying relevant materials such as HDB Annual Reports and availability of relevant factors, I have included additional factors on top of the recommended list. These additional factors are marked with (NEW).

5.1 Structural Factors

  • Area of the unit
  • Floor level
  • Remaining lease
  • Age of the unit
  • Main Upgrading Program (MUP), Home Improvement Program (HIP) completed

5.2 Locational Factors

5.2.1 Accessibility

  • Proximity to CBD
  • Proximity to MRT
  • Number of bus stops within 350m

5.2.2 Recreational and Lifestyle

  • Proximity to park
  • Proximity to food courts/hawker centres
  • Proximity to shopping malls
  • Proximity to supermarkets
  • Proximity to SportsSG Sports Centre (NEW)
    • Accessibility to sports facilities may be important to residents who lead an active lifestyle, making flats closer to sports centers more attractive and potentially boosting their resale prices.
  • Proximity to Community Club (NEW)
    • Community clubs offer recreational, educational, and social services, enhancing the quality of life for nearby residents. This added convenience can positively impact resale prices as buyers may value proximity to community amenities.
  • Proximity to National Library (NEW)
    • Libraries provide educational resources and spaces for community activities. Being close to a national library can be an attractive feature for families and individuals who value learning and community engagement, potentially increasing the appeal of nearby flats.

5.2.3 Educational and Family Services

  • Proximity to good primary schools
  • Number of kindergartens within 350m
  • Number of childcare centres within 350m
  • Number of primary schools within 1km

5.2.4 Healthcare and Eldercare

  • Proximity to eldercare facilities
  • No. of CHAS clinics within 350m (NEW)
    • CHAS (Community Health Assist Scheme) clinics provide subsidized healthcare services, making them important for families and elderly residents. Proximity to these clinics can increase a flat’s desirability for those who prioritize affordable and accessible healthcare, potentially influencing resale prices.

6 Data Sources

Here’s the information in markdown format with sample descriptions filled in:

Dataset Source Description
Master Plan 2019 Subzone Boundary data.gov.sg Geospatial boundaries for Singapore’s planning subzones in 2019.
Pre-Schools Location data.gov.sg Location data for pre-schools in Singapore.
HDB Resale Flat Prices data.gov.sg Transaction data for resale prices of HDB flats.
MRT and LRT Station Exits LTA DataMall Geolocation of MRT and LRT station exits.
Bus Stops LTA DataMall Geographic location of bus stops across Singapore.
Hawker Centres data.gov.sg Location data for hawker centers, contributing to lifestyle and accessibility factors.
Supermarkets data.gov.sg Locations of supermarkets in Singapore.
CHAS Clinics data.gov.sg Locations of CHAS clinics offering subsidized healthcare.
Community Clubs OneMap Location data of community clubs.
Libraries OneMap Locations of national and regional libraries.
SportsSG Sports Centres OneMap Locations of SportsSG facilities.
Eldercare Facilities OneMap Geolocation of eldercare facilities, important for assessing accessibility to senior services.
Primary Schools OneMap Locations of primary schools to analyze proximity to educational institutions.
Parks OneMap Geospatial data for parks in Singapore.
Shopping Malls Wikipedia Manually collected data on shopping malls in Singapore.
HDB MUP and HIP Programs HDB Data on HDB’s Main Upgrading Program (MUP) and Home Improvement Program (HIP), relevant for resale value analysis due to upgrades.
Top 20 Primary Schools Various sources List of top primary schools in Singapore
National Parks OneMap Location data of national parks.

7 Data Collection and Geocoding

This section outlines how I collect datasets through APIs or manual methods and aspatial datasets are converted into geospatial datasets via geocoding.

7.1 Geocoding (resale)

For the data acquisition process, I’ll start by retrieving geospatial information for the relevant resale flats. This involves using the OneMap API to fetch latitude and longitude coordinates based on specified addresses.

Referencing code from In-Class Exercise 10, We’ll filter for 5 ROOM resale transactions from Jan 2023 to Dec 2024 and Jul 2024 to Sep 2024. As such, we save the number of API calls needed to OneMap by retrieving only data that we need.

resale <- read_csv("data/raw_data/resale.csv") %>%
  filter(
    (month >= "2023-01" & month <= "2023-12") | 
    (month >= "2024-07" & month <= "2024-09")
  ) %>%
  filter(flat_type == "5 ROOM")
# sanity check that filter is as intended
distinct_months <- resale %>%
  distinct(month) %>%
  arrange(month)

distinct_months

Next, we will tidy the data by creating new columns: - address: Combines block and street_name to form a complete address. - remaining_lease_yr: Extracts the remaining lease years as an integer. - remaining_lease_mth: Extracts the remaining lease months as an integer.

resale_tidy <- resale %>%
  mutate(address = paste(block,street_name)) %>%
  mutate(remaining_lease_yr = as.integer(
    str_sub(remaining_lease, 0, 2)))%>%
  mutate(remaining_lease_mth = as.integer(
    str_sub(remaining_lease, 9, 11)))

Next, we generate a sorted list of unique addresses from the filtered dataset, which will be used to retrieve geographical coordinates:

add_list <- sort(unique(resale_tidy$address))
length(add_list)

We are expected to make 3329 calls to OneMap API for geocoding. To streamline the process, we will use a helper function, get_coords to perform geocoding.

Code
get_coords <- function(add_list) {
  url <- "https://onemap.gov.sg/api/common/elastic/search"
  found <- data.frame()
  not_found <- data.frame()
  
  for (i in seq_along(add_list)) {
    address <- add_list[i]
    # verbose debug
    # print(i)
    
    query <- list('searchVal' = address, 'returnGeom' = 'Y', 
                  'getAddrDetails' = 'Y', 'pageNum' = '1')
    res <- GET(url, query = query)
    
    if (content(res)$found != 0) {
      # Retrieve and process the geospatial data
      tmp_df <- data.frame(content(res))[4:13]
      tmp_df$address <- address  # Add address to the found data
      found <- rbind(found, tmp_df)
    } else {
      # Add to not_found if no data is found
      not_found <- rbind(not_found, data.frame(address = address))
    }
  }
  
  # Return a list containing both found and not_found dataframes
  list(found = found, not_found = not_found)
}

The output of get_coords is a list of found and not_found data frames. Successful API calls, where coordinates are retrieved, are stored in found, while unsuccessful calls (where no data was returned) are recorded in not_found.

coord_results <- get_coords(add_list)
found <- coord_results[["found"]]
Tip

Remember to review thenot_found dataframe to catch any API failures or missing data.

We will save the found data frame for future reference if needed:

write_rds(found, "data/rds/found.rds")
found <- read_rds("data/rds/found.rds")

Next, we will tidy the columns to an simpler format for easy referencing.

Code
found_tidy <- found %>%
  select(results.BLK_NO, results.ROAD_NAME, results.POSTAL, results.X, results.Y, address) %>%
  rename(
    POSTAL = results.POSTAL,
    XCOORD = results.X,
    YCOORD = results.Y,
    BLK_NO = results.BLK_NO,
    ROAD_NAME = results.ROAD_NAME
  )

Using the address fields, we can join resale_tidy with found_tidy.

resale_geocoded = left_join(
  resale_tidy, found_tidy, 
  by = c('address' = 'address'))
# sanity check that all postal code is populated
sum(is.na(resale_geocoded$POSTAL))

Next, we will convert resale_geocoded tibble dataframe to sf:

resale_geocoded_sf <- st_as_sf(resale_geocoded, 
                            coords = c("XCOORD",
                                       "YCOORD"),
                            crs=3414)

Then, we check for overlapping point features:

overlapping_points <- resale_geocoded_sf %>%
  mutate(overlap = lengths(st_equals(., .)) > 1)

nrow(overlapping_points)

There are 7679 overlapping points. We will use st_jitter() of sf package to move the point features by 5m to avoid overlapping point features.

resale_geocoded_sf_jit <- resale_geocoded_sf %>%
  st_jitter(amount = 5)

8 Self-Sourcing & More Geocoding

From the list of the factors listed above, some of the factors listed are not readily available via API calls or government portals and requires manual data sourcing.

8.1 Shopping Malls

For shopping mall data, I referred to Wikipedia and manually converted it into a CSV file. While the Wikipedia page was largely accurate, I corrected minor discrepancies based on my knowledge.

shopping_malls_raw <- read_csv("data/raw_data/shopping_malls.csv")

Since this dataset was manually curated, I performed a sanity check for duplicates:

duplicates_mall_name <- shopping_malls_raw %>%
  filter(duplicated(mall_name))
duplicates_mall_name 

Next, I created a unique, sorted list of mall names for geocoding:

addr_list_malls <- sort(unique(shopping_malls_raw$mall_name))

Then, we will reuse the get_coords() function to geocode our malls dataset:

coord_results_malls <- get_coords(addr_list_malls[1])
found_malls <- coord_results_malls[["found"]] %>%
  rename(x = results.X, y = results.Y) %>%
  select(address, x, y)
write_rds(found_malls, "data/rds/found_malls.rds")
Tip

Remember to review thenot_found dataframe to catch any API failures or missing data.

8.2 HDB MUP, HIP Program

For HDB’s MUP and HIP program data, I used HDB’s Upgrading/Estate Renewal Programmes Webservice. This data was manually compiled, and only completed HIP and MUP projects were included. Additionally, only HDB blocks that appear in resale_geocoded_sf_jit (the resale blocks of interest) were considered.

hdb_prog <- read_csv("data/raw_data/hdb_mup_hip.csv") %>%
  mutate(address = paste(block,street_name))

Since the dataset is manually curated, let’s check for duplicates as sanity check.

duplicate_hdb_prog <- hdb_prog %>%
  filter(duplicated(address))

duplicate_hdb_prog

Since hdb_prog is a subset of resale_geocoded_sf_jit, I reused found_tidy from earlier, avoiding additional API calls.

hdb_prog_geocoded <- hdb_prog %>%
  left_join(found_tidy %>% select(address, XCOORD, YCOORD), by = "address") %>%
  rename(x = XCOORD, y = YCOORD)

To confirm completeness, I checked for any missing coordinates:

filter(hdb_prog_geocoded, is.na(x) == TRUE)

Finally, I saved the geocoded HDB program data:

write_rds(hdb_prog_geocoded, "data/rds/hdb_prog_geocoded.rds")

8.3 Data Acquistion from OneMap

In this section, we’ll use the davzoku/onemapsgapi package, which interfaces with the OneMap API. We will mainly use get_theme API to retrieve geospatial data for specific locational factors.

(Optional) To explore available themes, you can use search_themes(token) to select themes relevant to your study. In the code below, I’ve pre-selected themes that appear relevant based on the theme.

Note that theme names may change over time, so verification is recommended.

Warning

OneMap API requires an API token. Do not commit or share this token publicly.

Use a .env file and include it in .gitignore. For example, I have a .env.example file. Copy .env.example to .env and paste in your API token.

Code
load_dot_env(file=".env")
token <- Sys.getenv("TOKEN")

# get list of available themes
available_themes <- search_themes(token)

# cc
communityclubs_tbl <- get_theme(token, "communityclubs")
communityclubs_sf <- st_as_sf(communityclubs_tbl, coords=c("Lng", "Lat"), crs=4326)
write_rds(communityclubs_sf, "data/rds/communityclubs_sf.rds")

# lib
libraries_tbl <- get_theme(token, "libraries")
libraries_sf <- st_as_sf(libraries_tbl, coords=c("Lng", "Lat"), crs=4326)
write_rds(libraries_sf, "data/rds/libraries_sf.rds")

# sports
sportsg_tbl <- get_theme(token, "ssc_sports_facilities")
sportsg_sf <- st_as_sf(sportsg_tbl, coords=c("Lng", "Lat"), crs=4326)
write_rds(sportsg_sf, "data/rds/sportsg_sf.rds")

# elder
eldercare_tbl <- get_theme(token, "eldercare")
eldercare_sf <- st_as_sf(eldercare_tbl, coords=c("Lng", "Lat"), crs=4326)
write_rds(eldercare_sf, "data/rds/eldercare_sf.rds")

# child
childcare_tbl <- get_theme(token, "childcare")
childcare_sf <- st_as_sf(childcare_tbl, coords=c("Lng", "Lat"), crs=4326)
write_rds(childcare_sf, "data/rds/childcare_sf.rds")

# kinder
kindergarten_tbl <- get_theme(token, "kindergartens")
kindergartens_sf <- st_as_sf(kindergarten_tbl, coords=c("Lng", "Lat"), crs=4326)
write_rds(kindergartens_sf, "data/rds/kindergartens_sf.rds")

# park
natparks_tbl <- get_theme(token, "nationalparks")
natparks_sf <- st_as_sf(natparks_tbl, coords=c("Lng", "Lat"), crs=4326)
write_rds(natparks_sf, "data/rds/natparks_sf.rds")

From the OneMap API, we were able to retrieve the following data:

  • community clubs
  • library
  • sports sg sports complex
  • eldercare facilities
  • childcare facilities
  • kindergartens
  • parks

8.3.1 Data Acquistion from OneMap (in the wild)

Acquiring primary school data requires additional steps beyond standard API calls to OneMap. Although OneMap provides a web-based mapping application for public visualization of geospatial data, extracting this data requires additional steps.

By using browser developer tools to examine network activity, we can identify the data source for primary schools as https://www.onemap.gov.sg/omapp/getAllPriSchools. This endpoint is undocumented in the OneMap API, suggesting it might be managed by a separate team or system. For comparison, OneMap documented APIs follows this format: https://www.onemap.gov.sg/api/public/themesvc/checkThemeStatus

See the image below for details.

After inspecting the JSON payload, I found duplicate columns and a row of null values at the start. To clean the data, I removed these unnecessary columns and dropped the null row to retain only the essential information:

pri_sch_tbl <- fromJSON("data/raw_data/getAllPriSchools_payload.json")[["SearchResults"]] %>%
  select(-PageCount, -HSE_BLK_NUM, -SCH_POSTAL_CODE, -SCH_ROAD_NAME, -HYPERLINK, -MOREINFO, -LATITUDE, -LONGITUDE, -GEOMETRY) %>% # remove dup columns
  slice(-1) # remove first null rows
glimpse(pri_sch_tbl)

Next, I converted the cleaned data into an sf data frame using the X and Y coordinates, setting the coordinate reference system to SVY21 (EPSG:3414). This geospatial format enables easier spatial analysis and visualization.

pri_sch_sf_3414 <- st_as_sf(pri_sch_tbl, coords=c("SCH_X_ADDR", "SCH_Y_ADDR"), crs=3414)
write_rds(pri_sch_sf_3414, "data/rds/pri_sch_sf_3414.rds")
8.3.1.1 Good Primary Schools

Numerous articles and studies highlight the impact of proximity to top primary schools on HDB prices. For example, Living Near A Popular Primary School: The Data On HDB Prices Within 1KM May Surprise You explores this topic.

After reviewing multiple top primary school lists from sources like Creative Campus, Joyous Learning, and Math Nuggets, I compiled a top 20 list:

Code
top_20_primary_schools <- c(
  "Methodist Girls' School (Primary)",
  "Tao Nan School",
  "Ai Tong School",
  "Holy Innocents' Primary School",
  "CHIJ St. Nicholas Girls' School",
  "Admiralty Primary School",
  "St. Joseph's Institution Junior",
  "Catholic High School",
  "Anglo-Chinese School (Junior)",
  "Chongfu School",
  "Kong Hwa School",
  "St. Hilda's Primary School",
  "Anglo-Chinese School (Primary)",
  "Nan Chiau Primary School",
  "Nan Hua Primary School",
  "Nanyang Primary School",
  "Pei Hwa Presbyterian Primary School",
  "Kuo Chuan Presbyterian Primary School",
  "Rulang Primary School",
  "Singapore Chinese Girls' Primary School"
)

After converting these school names to uppercase for case consistency, I filtered pri_sch_sf_3414 to retain only the top 20 schools and saved this filtered dataset:

top_20_primary_schools <- toupper(top_20_primary_schools)
# Filter the pri_sch_sf_3414 dataframe for the top 20 schools
pri_sch_sf_3414_top_20 <- pri_sch_sf_3414 %>%
  filter(SCHOOLNAME %in% top_20_primary_schools)

write_rds(pri_sch_sf_3414_top_20, "data/rds/pri_sch_sf_3414_top_20.rds")

9 Import Geospatial Data & Pre-processing

In this section, we will import geospatial datasets that we obtained directly from the government portals and those prepared from earlier section.

Whenever possible, we will import geospatial datasets and transform them to the SVY21 coordinate reference system (CRS), represented by EPSG code 3414. This ensures that all datasets maintain a consistent CRS, facilitating accurate spatial analysis and alignment across layers.

9.1 Import Data

Master Plan Subzone 2019

mpsz_sf <- st_read(dsn = "data/raw_data", layer = "MPSZ-2019")  %>%
  st_transform(crs = 3414)
Reading layer `MPSZ-2019' from data source 
  `/Users/walter/code/isss626/isss626-gaa/Take-home_Ex/Take-home_Ex03/data/raw_data' 
  using driver `ESRI Shapefile'
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
Geodetic CRS:  WGS 84

MRT & LRT

train_sf <- st_read(dsn = "data/raw_data", layer = "Train_Station_Exit_Layer") %>%
  st_transform(crs = 3414)

Bus Stop

bus_sf <- st_read(dsn = "data/raw_data", layer = "BusStop") %>%
  st_transform(crs = 3414)

Hawker Centres

hawker_sf <- st_read("data/raw_data/HawkerCentresKML.kml") %>%
  st_transform(crs = 3414)

Supermarkets

supermarket_sf <- st_read("data/raw_data/SupermarketsKML.kml") %>%
  st_transform(crs = 3414)

CHAS Clinics

CHAS_sf <- st_read("data/raw_data/CHASClinics.kml") %>%
  st_transform(crs = 3414)

Resale

resale_geocoded_sf_jit_sf <- read_rds("data/rds/resale_geocoded_sf_jit.rds")
resale_geocoded_sf_train_jit_sf <- read_rds("data/rds/resale_geocoded_sf_train_jit.rds")
resale_geocoded_sf_test_jit_sf <- read_rds("data/rds/resale_geocoded_sf_test_jit.rds")

Shopping Malls

malls<- read_rds("data/rds/found_malls.rds")
malls_sf <- st_as_sf(malls, coords = c("x", "y"), crs = 3414)

HDB Upgrade Programs (MUP, HIP)

hdb_upgrade <- read_rds("data/rds/hdb_prog_geocoded.rds")
hdb_upgrade_sf <- st_as_sf(hdb_upgrade, coords = c("x", "y"), crs = 3414)

Primary Schools & Good Primary Schools

sch_sf <- read_rds("data/rds/pri_sch_sf_3414.rds")
good_sch_sf <- read_rds("data/rds/pri_sch_sf_3414_top_20.rds")

Community Clubs

communityclubs_sf <- read_rds("data/rds/communityclubs_sf.rds")
communityclubs_sf <- st_transform(communityclubs_sf, crs=3414)

Libraries

libraries_sf <- read_rds("data/rds/libraries_sf.rds")
libraries_sf <- st_transform(libraries_sf, crs=3414)

Sports Complex

sportsg_sf <- read_rds("data/rds/sportsg_sf.rds")
sportsg_sf <- st_transform(sportsg_sf, crs=3414)

Eldercare

eldercare_sf <- read_rds("data/rds/eldercare_sf.rds")
eldercare_sf <- st_transform(eldercare_sf, crs=3414)

Kindergarten

kindergarten_sf <- read_rds("data/rds/kindergartens_sf.rds")
kindergarten_sf <- st_transform(kindergarten_sf, crs=3414)

Childcare

childcare_sf <- read_rds("data/rds/childcare_sf.rds")
childcare_sf <- st_transform(childcare_sf, crs=3414)

Parks

natparks_sf <- read_rds("data/rds/natparks_sf.rds")
natparks_sf <- st_transform(natparks_sf, crs=3414)

Then, let’s make a list to manage these datasets for subsequent steps:

loc_fac_sfs <- list(
  resale_geocoded_jit = resale_geocoded_sf_jit_sf,
  malls = malls_sf,
  hdb_upgrade = hdb_upgrade_sf,
  sch = sch_sf,
  good_sch = good_sch_sf,
  communityclubs = communityclubs_sf,
  libraries = libraries_sf,
  sportsg = sportsg_sf,
  eldercare = eldercare_sf,
  kindergarten = kindergarten_sf,
  childcare = childcare_sf,
  natparks = natparks_sf,
  train = train_sf,
  bus = bus_sf,
  hawker = hawker_sf,
  supermarket = supermarket_sf,
  CHAS = CHAS_sf
)

9.2 Data Pre-processing

For each dataset, we will perform the following checks to ensure consistency and accuracy in our geospatial analysis:

  • Confirm that dimensions are in XY format only and remove any Z-dimensions if present using st_zm().
  • Check for invalid geometries and correct them.

During the data import process, we observed that a few datasets contained Z-dimensions, which we removed for consistency.

hawker_sf <- st_zm(hawker_sf, drop=TRUE, what = "ZM")
supermarket_sf <- st_zm(supermarket_sf, drop=TRUE, what = "ZM")
CHAS_sf <- st_zm(CHAS_sf, drop=TRUE, what = "ZM")

From our lessons, we learned that mpsz_sf contains some invalid geometries. We’ll address this by applying st_make_valid().

length(which(st_is_valid(mpsz_sf) == FALSE))
[1] 6
mpsz_sf <- st_make_valid(mpsz_sf)

From manual inspection, I observed that bus_sf includes bus stops located outside of Singapore, such as LARKIN TER in Malaysia. To ensure data consistency, I’ll validate all spatial datasets in loc_fac_sfs, filtering them to include only points within Singapore’s boundaries.

Code
loc_fac_sfs_within_sg <- list()

for (name in names(loc_fac_sfs)) {
  
  sf_object <- loc_fac_sfs[[name]]
  
  # Filter points that are within Singapore boundary (mpsz_sf)
  sf_object_within <- sf_object[
    apply(st_within(sf_object, mpsz_sf, sparse = FALSE), 1, any), 
  ]
  
  loc_fac_sfs_within_sg[[name]] <- sf_object_within
}

After comparing loc_fac_sfs with loc_fac_sfs_within_sg, I found that five bus stops and one CHAS clinic were excluded, confirming these were located outside Singapore.

To ensure all datasets are compatible, we perform a final check on the coordinate reference system (CRS) for each item:

Code
# additional sanity checking all datasets before proceeding
for (i in loc_fac_sfs_within_sg) {
  print(st_crs(i))
}

Finally, loc_fac_sfs_within_sg is saved for future use:

write_rds(loc_fac_sfs_within_sg, "data/rds/loc_fac_sfs_within_sg.rds")

10 Visualization & Verification

In this section, we’ll plot each spatial dataset to examine the distribution of features, verify that they fall within Singapore’s boundaries, and assess whether their locations make sense.

loc_fac_sfs_within_sg <- read_rds("data/rds/loc_fac_sfs_within_sg.rds")
Code
tmap_mode("plot")
tmap_options(check.and.fix = TRUE)

for (name in names(loc_fac_sfs_within_sg)) {
  
  sf_object <- loc_fac_sfs_within_sg[[name]]
  
  map <- tm_shape(mpsz_sf) +
    tm_polygons("REGION_N", alpha=0.4) +
    tm_shape(sf_object) +
    tm_dots(col = 'red', size = 0.02) +
    tm_layout(
      main.title = name, 
      main.title.position = "center"
    )
    print(map)
}

Note

During this visual inspection, we observed some point features in unexpected locations, such as supermarkets and CHAS clinics in remote areas like Tuas and Changi Airport. Manual verification suggests these are valid entities, Judging from the distribution of resale_geocoded_jit, let’s make a mental note these distant point features may not be relevant or influential when performing proximity or facility counts around resale flats.

11 Feature Engineering

In this section, we will convert variables to usable formats (e.g., ordinal levels for floor range), calculating proximity to features like the CBD, and calculate nearby facilities counts to better capture relevant structural, locational factors.

11.1 Floor Level

  1. Extract unique storey_range values and sort them.
  2. Create an ordinal variable storey_order based on the sorted storey_range.
resale_geocoded_jit_sf <- loc_fac_sfs_within_sg[["resale_geocoded_jit"]]
storeys <- sort(unique(resale_geocoded_jit_sf$storey_range))
storeys
 [1] "01 TO 03" "04 TO 06" "07 TO 09" "10 TO 12" "13 TO 15" "16 TO 18"
 [7] "19 TO 21" "22 TO 24" "25 TO 27" "28 TO 30" "31 TO 33" "34 TO 36"
[13] "37 TO 39" "40 TO 42" "43 TO 45"
storey_order <- 1:length(storeys)
storey_range_order <- data.frame(storeys, storey_order)
resale_geocoded_jit_sf <- left_join(resale_geocoded_jit_sf,  storey_range_order, by = c("storey_range" = "storeys"))

11.2 Remaining Lease (Months)

Combine remaining_lease_yr and remaining_lease_mth into a single column, remaining_lease, calculated in months:

resale_geocoded_jit_sf <- resale_geocoded_jit_sf %>%
  mutate(
    remaining_lease = replace_na(remaining_lease_yr, 0) * 12 +
                      replace_na(remaining_lease_mth, 0)
  )  

summary(resale_geocoded_jit_sf$remaining_lease)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  551.0   823.5   911.0   914.9  1055.0  1145.0 

11.3 Age of Unit (Months)

Reverse-engineer the age based on a typical 99-year lease, subtracting the remaining_lease from 99 years in months:

resale_geocoded_jit_sf <- resale_geocoded_jit_sf %>%
  mutate(age = 99 * 12 - remaining_lease)

summary(resale_geocoded_jit_sf$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   43.0   133.0   277.0   273.1   364.5   637.0 

11.4 HDB Upgrade Program (HIP / MUP)

Load the HDB upgrade data and join it with resale_geocoded_jit_sf, adding hip and mup one-hot encoded flags to indicate whether each property has undergone the Home Improvement Program (HIP) or the Main Upgrading Program (MUP).

hdb_upgrade <- read_rds("data/rds/hdb_prog_geocoded.rds") %>%
  select(address, type)
resale_geocoded_jit_sf <- left_join(resale_geocoded_jit_sf, hdb_upgrade, by = c('address' = 'address'))
resale_geocoded_jit_sf <- resale_geocoded_jit_sf %>% mutate(hip = ifelse(is.na(type), 0, ifelse(type == "HIP", 1, 0)))
resale_geocoded_jit_sf <- resale_geocoded_jit_sf %>% mutate(mup = ifelse(is.na(type), 0, ifelse(type == "MUP", 1, 0)))

11.5 Proximity Distance Calculation

Note

The Proximity function is referenced from Megan’s work.

Code
proximity <- function(df1, df2, varname) {
  dist_matrix <- st_distance(df1, df2) %>%
    drop_units()
  df1[,varname] <- rowMins(dist_matrix)
  return(df1)
}

From this site, we can consider Downtown Core as the CBD location.

lat <- 1.287953
lng <- 103.851784

cbd_sf <- data.frame(lat, lng) %>%
  st_as_sf(coords = c("lng", "lat"), crs=4326) %>%
  st_transform(crs=3414)

In the code block below, it calculates the distance from each HDB resale point to nearby facilities such as community clubs, libraries, parks, and malls, etc, generating 11 new proximity features in resale_geocoded_jit_sf:

Code
resale_geocoded_jit_sf <-
  proximity(resale_geocoded_jit_sf, cbd_sf, "PROX_CBD") %>%
  proximity(., loc_fac_sfs_within_sg[["communityclubs"]], "PROX_CC") %>%
  proximity(., loc_fac_sfs_within_sg[["libraries"]], "PROX_LIB") %>%  
  proximity(., loc_fac_sfs_within_sg[["sportsg"]], "PROX_SPORTS") %>%
  proximity(., loc_fac_sfs_within_sg[["eldercare"]], "PROX_ELDERCARE") %>%
  proximity(., loc_fac_sfs_within_sg[["natparks"]], "PROX_PARK") %>%  
  proximity(., loc_fac_sfs_within_sg[["hawker"]], "PROX_HAWKER") %>%  
  proximity(., loc_fac_sfs_within_sg[["train"]], "PROX_TRAIN") %>%  
  proximity(., loc_fac_sfs_within_sg[["supermarket"]], "PROX_SMKT") %>%  
  proximity(., loc_fac_sfs_within_sg[["good_sch"]], "PROX_GD_SCH") %>%
  proximity(., loc_fac_sfs_within_sg[["malls"]], "PROX_MALLS")

11.6 Nearby Facilities Count

Note

The nearby facilities count function, num_radius is from Megan’s work.

Code
num_radius <- function(df1, df2, varname, radius) {
  dist_matrix <- st_distance(df1, df2) %>%
    drop_units() %>%
    as.data.frame()
  df1[,varname] <- rowSums(dist_matrix <= radius)
  return(df1)
}

The following code calculates the number of nearby facilities within specified distances for each HDB resale point, adding five new features to resale_geocoded_jit_sf:

Code
resale_geocoded_jit_sf <-
  num_radius(resale_geocoded_jit_sf, loc_fac_sfs_within_sg[["kindergarten"]], "NUM_KINDERGARTEN_350M", 350) %>%
  num_radius(., loc_fac_sfs_within_sg[["childcare"]], "NUM_CHILDCARE_350M", 350) %>%
  num_radius(., loc_fac_sfs_within_sg[["bus"]], "NUM_BUS_STOP_350M", 350) %>%
  num_radius(., loc_fac_sfs_within_sg[["sch"]], "NUM_PRI_SCH_1KM", 1000) %>%
  num_radius(., loc_fac_sfs_within_sg[["CHAS"]], "NUM_CLINIC_350M", 350)

Let’s save the fully feature-engineered dataset for future reference:

write_rds(resale_geocoded_jit_sf, "data/rds/resale_geocoded_jit_full_sf.rds")
resale_geocoded_jit_full_sf <- read_rds("data/rds/resale_geocoded_jit_full_sf.rds")

Next, we save a lite version containing only essential columns, removing unnecessary data fields like block numbers, street names, and postal codes:

resale_geocoded_jit_ft_sf <- resale_geocoded_jit_sf %>%
  select(-block, -street_name, -storey_range, -lease_commence_date, -remaining_lease_yr, -remaining_lease_mth, -BLK_NO, -ROAD_NAME, -POSTAL, -type)
write_rds(resale_geocoded_jit_ft_sf, "data/rds/resale_geocoded_jit_ft_sf.rds")

12 Train-Test Split

We will also save the dataset into training and testing subsets based on the date range.

The datasets are saved as follows: - resale_geocoded_jit_ft_sf: the full dataset with jitter applied & engineered features - resale_geocoded_jit_ft_train_sf: the training subset. - resale_geocoded_jit_ft_test_sf: the testing subset.

resale_geocoded_jit_ft_train_sf <- resale_geocoded_jit_ft_sf %>%
  filter(month >= "2023-01" & month <= "2023-12")

resale_geocoded_jit_ft_test_sf <- resale_geocoded_jit_ft_sf %>%
  filter(month >= "2024-07" & month <= "2024-09")


write_rds(resale_geocoded_jit_ft_train_sf, "data/rds/resale_geocoded_jit_ft_train_sf.rds")
write_rds(resale_geocoded_jit_ft_test_sf, "data/rds/resale_geocoded_jit_ft_test_sf.rds")

13 Exploratory Data Analysis

To prevent information leakage and reduce bias, we’ll conduct exploratory data analysis (EDA) on the training dataset only.

resale_train_sf <- read_rds("data/rds/resale_geocoded_jit_ft_train_sf.rds")

Let’s observe a quick summary of the train set:

summary(resale_train_sf)
    month               town            flat_type         floor_area_sqm 
 Length:5852        Length:5852        Length:5852        Min.   : 99.0  
 Class :character   Class :character   Class :character   1st Qu.:112.0  
 Mode  :character   Mode  :character   Mode  :character   Median :116.0  
                                                          Mean   :117.7  
                                                          3rd Qu.:122.0  
                                                          Max.   :153.0  
  flat_model        remaining_lease   resale_price       address         
 Length:5852        Min.   : 551.0   Min.   : 418000   Length:5852       
 Class :character   1st Qu.: 826.0   1st Qu.: 590000   Class :character  
 Mode  :character   Median : 914.0   Median : 650000   Mode  :character  
                    Mean   : 917.5   Mean   : 685044                     
                    3rd Qu.:1063.0   3rd Qu.: 740000                     
                    Max.   :1145.0   Max.   :1480000                     
          geometry     storey_order         age             hip        
 POINT        :5852   Min.   : 1.000   Min.   : 43.0   Min.   :0.0000  
 epsg:3414    :   0   1st Qu.: 2.000   1st Qu.:125.0   1st Qu.:0.0000  
 +proj=tmer...:   0   Median : 3.000   Median :274.0   Median :0.0000  
                      Mean   : 3.435   Mean   :270.5   Mean   :0.1077  
                      3rd Qu.: 4.000   3rd Qu.:362.0   3rd Qu.:0.0000  
                      Max.   :15.000   Max.   :637.0   Max.   :1.0000  
      mup            PROX_CBD        PROX_CC            PROX_LIB     
 Min.   :0.0000   Min.   : 1610   Min.   :   2.156   Min.   :  59.7  
 1st Qu.:0.0000   1st Qu.:11422   1st Qu.: 340.759   1st Qu.: 771.0  
 Median :0.0000   Median :13822   Median : 525.043   Median :1157.5  
 Mean   :0.0299   Mean   :13199   Mean   : 540.417   Mean   :1185.3  
 3rd Qu.:0.0000   3rd Qu.:16055   3rd Qu.: 712.111   3rd Qu.:1593.5  
 Max.   :1.0000   Max.   :19575   Max.   :1992.808   Max.   :2835.2  
  PROX_SPORTS      PROX_ELDERCARE       PROX_PARK       PROX_HAWKER     
 Min.   :  68.67   Min.   :   2.445   Min.   :  62.4   Min.   :  50.77  
 1st Qu.: 904.80   1st Qu.: 399.111   1st Qu.: 465.8   1st Qu.: 437.81  
 Median :1396.17   Median : 686.745   Median : 708.4   Median : 764.98  
 Mean   :1551.37   Mean   : 874.553   Mean   : 787.9   Mean   : 869.69  
 3rd Qu.:2067.69   3rd Qu.:1196.152   3rd Qu.:1008.8   3rd Qu.:1166.53  
 Max.   :4196.13   Max.   :3279.141   Max.   :2204.6   Max.   :2868.70  
   PROX_TRAIN        PROX_SMKT         PROX_GD_SCH        PROX_MALLS      
 Min.   :  22.55   Min.   :   1.953   Min.   :  79.89   Min.   :   2.607  
 1st Qu.: 278.85   1st Qu.: 181.593   1st Qu.:1152.81   1st Qu.: 361.147  
 Median : 480.25   Median : 279.522   Median :1948.38   Median : 546.031  
 Mean   : 561.12   Mean   : 300.074   Mean   :2138.98   Mean   : 626.413  
 3rd Qu.: 750.55   3rd Qu.: 395.668   3rd Qu.:2732.04   3rd Qu.: 798.139  
 Max.   :2127.62   Max.   :1455.808   Max.   :7156.54   Max.   :2178.680  
 NUM_KINDERGARTEN_350M NUM_CHILDCARE_350M NUM_BUS_STOP_350M NUM_PRI_SCH_1KM
 Min.   :0.000         Min.   : 0.000     Min.   : 0.000    Min.   :0.000  
 1st Qu.:0.000         1st Qu.: 3.000     1st Qu.: 6.000    1st Qu.:2.000  
 Median :1.000         Median : 5.000     Median : 8.000    Median :3.000  
 Mean   :1.055         Mean   : 4.921     Mean   : 8.222    Mean   :3.319  
 3rd Qu.:1.000         3rd Qu.: 6.000     3rd Qu.:10.000    3rd Qu.:4.000  
 Max.   :8.000         Max.   :22.000     Max.   :18.000    Max.   :9.000  
 NUM_CLINIC_350M 
 Min.   : 0.000  
 1st Qu.: 1.000  
 Median : 2.000  
 Mean   : 2.828  
 3rd Qu.: 4.000  
 Max.   :13.000  

To have a better sense of the data, we can visualize the distribution of data with a histogram:

Code
fill_col <- "#3498db"

ggplotly(
  ggplot(resale_train_sf, aes(x = resale_price)) +
    geom_histogram(bins = 30, color="black",fill = fill_col) +
    ggtitle("Distribution of Resale Prices") +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5))
)
Note

Observations

The histogram shows the distribution of resale prices for 5-room flats in Singapore from Jan - Dec 2023 and Jul - Sep 2024.

The plot spans from around SGD 400,000 to over SGD 1.5 million, and the distribution is right-skewed, with a tail extending towards higher prices at above $1 million.

We can also visualize the spatial distribution of the 5 room resale flat prices using tmap.

Code
tmap_mode("plot")
tmap_options(check.and.fix = TRUE) 

tm_shape(mpsz_sf) +
  tm_polygons("REGION_N", alpha=0.4) +
  tm_shape(resale_train_sf) +  
  tm_dots(
    col = "resale_price",
    alpha = 0.5,
    style = "quantile",
    size = 0.1,
    title = "Resale Price (SGD)"
  ) +
  tm_layout(
    outer.margins = c(0.1, 0, 0, 0),
    main.title = "Spatial Distribution of Resale Prices for 5-Room Flats",
    main.title.position = "center",
    main.title.size = 1.5,
    legend.position = c("left", "bottom"),
    legend.title.size = 1.2,
    legend.text.size = 0.8,
    frame = TRUE,
    inner.margins = c(0.05, 0.05, 0.05, 0.05)
  )

Note

Observation

This map shows the spatial distribution of resale prices for 5-room flats in Singapore, with darker shades indicating higher prices.

High resale prices are concentrated in the central and southern downtown areas, as well as in Toa Payoh, Bishan. High resale prices can also be observed in the north-eastern side of Singapore at Punggol area, likely due to the younger age of flats in these locations.

In contrast, the western and northern regions, such as Jurong and Woodlands, offer more affordable resale options.

13.1 Visualizing Structural Factors

To examine the structural characteristics of the dataset, we will categorize and visualize the structural factors in two groups: categorical and numerical.

First, we identify and group the structural factors:

struct_factors_cat <- c("mup", "hip")
struct_factors_num <- c("floor_area_sqm", "storey_order", "remaining_lease", "age")

Then, we create a plot_factors function to visualize these factors. This function takes in the dataset, a list of factors to plot, and layout specifications. Based on whether the factor is categorical or numerical, it creates either bar charts or histograms.

Code
plot_factors <- function(data, factors, ncol = 2, nrow = 2, is_categorical = FALSE) {
  
  # create plots depending on categorical or numerical
  plots <- lapply(factors, function(var) {
    if (is_categorical) {
      ggplot(data, aes_string(x = var)) +
        geom_bar(color = "black", fill = fill_col) +
        ggtitle(paste("Dist. of", var))
    } else {
      ggplot(data, aes_string(x = var)) +
        geom_histogram(bins = 20, color = "black", fill = fill_col) +
        ggtitle(paste("Dist. of", var))
    }
  })
  
  ggarrange(plotlist = plots, ncol = ncol, nrow = nrow)
}
Code
plot_factors(resale_train_sf, struct_factors_cat, ncol = 2, nrow = 1, is_categorical = TRUE)

Note

Observations

The bar charts show the distribution of the mup and hip variables across the dataset:

  1. MUP (Main Upgrading Program):
    • The majority of records do not have the MUP completed, as indicated by the large count for mup = 0.
    • Only a small fraction of records have MUP completed (mup = 1), suggesting that this upgrade program is relatively rare in the dataset.
  2. HIP (Home Improvement Program):
    • Similar to MUP, most records do not have the HIP completed, with a large count for hip = 0.
    • However, the count of hip = 1 (indicating HIP completion) is slightly higher than that of MUP.

This distribution indicates that most of the resale flats in the dataset have not undergone these upgrading programs. Let’s keep a mental note on this.

Code
plot_factors(resale_train_sf, struct_factors_num, ncol = 2, nrow = 2, is_categorical = FALSE)

Note

Observations

The figure aboves show the distribution of the categorical structural factors for the train set:

  1. Floor Area (sqm):
    • The majority of HDB units have a floor area between 100 and 120 sqm. (which is expected)
  2. Storey Order:
    • Note that x-axis for this histogram is the storey range
    • Interestingly, most units transacted are located on lower storey range.
  3. Remaining Lease:
    • The distribution of remaining lease values reveals two prominent peaks: one around 900 months (approximately 75 years remaining) and another near 1150 months (around 95 years remaining).
    • This suggests that a significant portion of the 5-room resale transactions in 2023 involve either units that have just met the Minimum Occupation Period (MOP) or are around 20 years old.
  4. Age:
    • The age distribution is inversely related to the remaining lease.

13.2 Visualizing Locational Factors

loc_factors <- c("PROX_CBD", "PROX_CC", "PROX_LIB", "PROX_SPORTS",
               "PROX_ELDERCARE", "PROX_PARK", "PROX_HAWKER", "PROX_TRAIN", "PROX_SMKT",
               "PROX_GD_SCH", "PROX_MALLS", "NUM_KINDERGARTEN_350M", "NUM_CHILDCARE_350M",
               "NUM_BUS_STOP_350M", "NUM_PRI_SCH_1KM", "NUM_CLINIC_350M")
plot_factors(resale_train_sf, loc_factors, ncol = 4, nrow = 4, is_categorical = FALSE)

Note

Observations

Proximity Variables (PROX)

  • General Skewness: Most proximity-based location factors (e.g., PROX_CC, PROX_ELDERCARE, PROX_PARK) show right-skewed distributions.

  • Proximity to CBD (PROX_CBD): This variable is slightly left-skewed, indicating that most units are further from the CBD, with fewer closer to the city center.

  • Common Amenities: Proximity to facilities such as Community Centers, Libraries, Malls and Sports Centers generally peaks around 500–1000 meters, showing moderate accessibility for most flats.

Number of Facility Counts within Radius (NUM)

  • NUM_KINDERGRATEN, NUM_CLINIC: These factors shows a strong right skew, with most units having between 0-3 facilities within 350 meters.

  • NUM_CHILDCARE_350M: The distribution shows a slight right skew, with most units having between 0-5 childcare centers within 350 meters.

  • NUM_BUS_STOP_350M: The distribution is more balanced, with counts mostly ranging between 5-10 bus stops within 350 meters. This indicates relatively good access to bus stops, which is essential for public transport connectivity.

13.3 Visualizing Train-Test Split

To understand the distribution of towns in the training and test datasets and check if the data split is stratified, we plot the town distribution across both splits by accumulating the sales numbers by town and compare between splits.

Code
resale_test_sf <- read_rds("data/rds/resale_geocoded_jit_ft_test_sf.rds")
train_counts <- as.data.frame(table(resale_train_sf$town))
colnames(train_counts) <- c("town", "train")

test_counts <- as.data.frame(table(resale_test_sf$town))
colnames(test_counts) <- c("town", "test")

town_counts <- merge(train_counts, test_counts, by = "town", all = TRUE)

town_counts <- full_join(train_counts, test_counts, by = "town") %>%
  replace_na(list(train = 0, test = 0)) %>%
  arrange(desc(train + test))

t_t_split_plot <- plot_ly(town_counts, y = ~town) %>%
  add_trace(x = ~train, name = "Train", type = "bar", orientation = "h") %>%
  add_trace(x = ~test, name = "Test", type = "bar", orientation = "h") %>%
  layout(
    title = list(text = "Town Distribution in Train vs Test Splits"),
    xaxis = list(title = "Count"),
    yaxis = list(title = "Town", categoryorder = "array", categoryarray = ~town),
    barmode = "group"
  )

t_t_split_plot
Note

Observations

Overall, the ratio between train and test data appears relatively consistent across towns.

Sengkang and Punggol have the highest number of data points for both train and test sets, suggesting a higher volume of resale transactions or data availability in these towns. These is reasonable as they are young towns with younger flats and flats that recently MOP’ed.

Central Area and Bukit Timah have the fewest data points, which might suggest that smaller number of HDB flats are available in the resale market or they might simply be too expensive.

14 Multicolinearity Check

Multicollinearity can affect the stability and interpretability of a regression model. To identify potential multicollinearity, we will use ggcorrmat() of ggstatsplot to plot a correlation matrix to check if there are pairs of highly correlated independent variables.

Variables with correlations above 0.5 are potential candidates for removal or further analysis, depending on their impact on the model.

resale_nogeo <- resale_train_sf %>%
  st_drop_geometry()

corr_list <- c("floor_area_sqm", "remaining_lease", "storey_order", "age",
               "hip", "mup", "PROX_CBD", "PROX_CC", "PROX_LIB", "PROX_SPORTS",
               "PROX_ELDERCARE", "PROX_PARK", "PROX_HAWKER", "PROX_TRAIN", "PROX_SMKT",
               "PROX_GD_SCH", "PROX_MALLS", "NUM_KINDERGARTEN_350M", "NUM_CHILDCARE_350M",
               "NUM_BUS_STOP_350M", "NUM_PRI_SCH_1KM", "NUM_CLINIC_350M")
  

ggstatsplot::ggcorrmat(resale_nogeo, corr_list)

Note

Observations

From the matrix, we observe that:

  • Age and Remaining Lease are perfectly negatively correlated (-1), meaning they carry the same information in opposite directions. We will retain only one of these variables (remove Age) to avoid redundancy.

15 OLS Multiple Linear Regression Model

We will move on to build our first model. We fit a multiple linear regression model to predict resale_price based on various structural and locational factors. The ols_regress function from olsrr is used to evaluate the model.

resale_mlr <- lm(formula = resale_price ~ floor_area_sqm+ remaining_lease+ storey_order+
               hip+ mup+ PROX_CBD+ PROX_CC+ PROX_LIB+ PROX_SPORTS+
               PROX_ELDERCARE+ PROX_PARK+ PROX_HAWKER+ PROX_TRAIN+ PROX_SMKT+
               PROX_GD_SCH+ PROX_MALLS+ NUM_KINDERGARTEN_350M+ NUM_CHILDCARE_350M+
               NUM_BUS_STOP_350M+ NUM_PRI_SCH_1KM+ NUM_CLINIC_350M,
                 data = resale_train_sf)

olsrr::ols_regress(resale_mlr)
                              Model Summary                                
--------------------------------------------------------------------------
R                           0.864       RMSE                    71124.152 
R-Squared                   0.747       MSE                5077734242.025 
Adj. R-Squared              0.746       Coef. Var                  10.402 
Pred R-Squared              0.745       AIC                    147412.478 
MAE                     53128.738       SBC                    147565.992 
--------------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                    ANOVA                                      
------------------------------------------------------------------------------
                    Sum of                                                    
                   Squares          DF       Mean Square       F         Sig. 
------------------------------------------------------------------------------
Regression    8.739137e+13          21      4.161494e+12    819.557    0.0000 
Residual      2.960319e+13        5830    5077734242.025                      
Total         1.169946e+14        5851                                        
------------------------------------------------------------------------------

                                               Parameter Estimates                                                 
------------------------------------------------------------------------------------------------------------------
                model           Beta    Std. Error    Std. Beta       t        Sig           lower          upper 
------------------------------------------------------------------------------------------------------------------
          (Intercept)    -297812.438     28784.167                 -10.346    0.000    -354240.085    -241384.792 
       floor_area_sqm       6715.327       178.883        0.339     37.540    0.000       6364.650       7066.003 
      remaining_lease        581.129        10.043        0.600     57.866    0.000        561.441        600.816 
         storey_order      19907.859       503.598        0.280     39.531    0.000      18920.621      20895.098 
                  hip      21187.338      3586.425        0.046      5.908    0.000      14156.614      28218.061 
                  mup      15307.484      6383.041        0.018      2.398    0.017       2794.356      27820.611 
             PROX_CBD        -21.028         0.326       -0.585    -64.459    0.000        -21.668        -20.389 
              PROX_CC        -29.645         4.315       -0.054     -6.871    0.000        -38.103        -21.187 
             PROX_LIB        -31.941         2.150       -0.121    -14.857    0.000        -36.156        -27.727 
          PROX_SPORTS         11.558         1.494        0.069      7.735    0.000          8.629         14.488 
       PROX_ELDERCARE         -4.585         1.859       -0.021     -2.466    0.014         -8.230         -0.941 
            PROX_PARK         -1.503         2.534       -0.004     -0.593    0.553         -6.470          3.464 
          PROX_HAWKER        -17.719         2.228       -0.069     -7.953    0.000        -22.087        -13.352 
           PROX_TRAIN        -37.591         2.988       -0.099    -12.579    0.000        -43.449        -31.732 
            PROX_SMKT         22.426         6.755        0.025      3.320    0.001          9.183         35.669 
          PROX_GD_SCH         -4.684         0.787       -0.046     -5.950    0.000         -6.227         -3.140 
           PROX_MALLS        -13.517         3.046       -0.035     -4.437    0.000        -19.488         -7.545 
NUM_KINDERGARTEN_350M       8438.655      1025.462        0.064      8.229    0.000       6428.370      10448.941 
   NUM_CHILDCARE_350M      -4579.307       500.711       -0.077     -9.146    0.000      -5560.886      -3597.728 
    NUM_BUS_STOP_350M       -106.480       342.324       -0.002     -0.311    0.756       -777.561        564.602 
      NUM_PRI_SCH_1KM     -14934.129       742.115       -0.174    -20.124    0.000     -16388.949     -13479.308 
      NUM_CLINIC_350M       6868.982       527.562        0.101     13.020    0.000       5834.764       7903.200 
------------------------------------------------------------------------------------------------------------------
Code
summary(resale_mlr)
Note

Observations

The R-Squared value indicates that the model explains approximately 74.7% of the variance in resale prices, suggesting a strong fit.

Based on the high p-values (> 0.05), the following variables are not statistically significant and may not meaningfully contribute to the model:

  • PROX_PARK (p = 0.553)
  • NUM_BUS_STOP_350M (p = 0.756)

Removing these variables can simplify the model without sacrificing predictive power. We’ll further examine variable importance in later steps.

Now, we will build another model without the variables mentioned above:

resale_mlr2 <- lm(formula = resale_price ~ floor_area_sqm+ remaining_lease+ storey_order+
               hip+ mup+ PROX_CBD+ PROX_CC+ PROX_LIB+ PROX_SPORTS+
               PROX_ELDERCARE 
               #+ PROX_PARK
               + PROX_HAWKER+ PROX_TRAIN+ PROX_SMKT+
               PROX_GD_SCH+ PROX_MALLS+ NUM_KINDERGARTEN_350M+ NUM_CHILDCARE_350M
               # +NUM_BUS_STOP_350M
               + NUM_PRI_SCH_1KM+ NUM_CLINIC_350M,
                 data = resale_train_sf)


olsrr::ols_regress(resale_mlr2)
                              Model Summary                                
--------------------------------------------------------------------------
R                           0.864       RMSE                    71126.953 
R-Squared                   0.747       MSE                5076392689.525 
Adj. R-Squared              0.746       Coef. Var                  10.401 
Pred R-Squared              0.745       AIC                    147408.939 
MAE                     53120.877       SBC                    147549.104 
--------------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                    ANOVA                                      
------------------------------------------------------------------------------
                    Sum of                                                    
                   Squares          DF       Mean Square       F         Sig. 
------------------------------------------------------------------------------
Regression    8.738904e+13          19      4.599423e+12    906.042    0.0000 
Residual      2.960552e+13        5832    5076392689.525                      
Total         1.169946e+14        5851                                        
------------------------------------------------------------------------------

                                               Parameter Estimates                                                 
------------------------------------------------------------------------------------------------------------------
                model           Beta    Std. Error    Std. Beta       t        Sig           lower          upper 
------------------------------------------------------------------------------------------------------------------
          (Intercept)    -297334.543     28237.914                 -10.530    0.000    -352691.326    -241977.761 
       floor_area_sqm       6705.168       177.163        0.339     37.847    0.000       6357.862       7052.474 
      remaining_lease        580.840         9.980        0.599     58.202    0.000        561.276        600.403 
         storey_order      19894.393       503.072        0.280     39.546    0.000      18908.185      20880.601 
                  hip      21452.539      3553.934        0.047      6.036    0.000      14485.510      28419.568 
                  mup      15672.686      6358.226        0.019      2.465    0.014       3208.205      28137.168 
             PROX_CBD        -21.102         0.308       -0.587    -68.576    0.000        -21.705        -20.498 
              PROX_CC        -29.653         4.284       -0.054     -6.922    0.000        -38.051        -21.255 
             PROX_LIB        -31.894         2.121       -0.120    -15.039    0.000        -36.051        -27.736 
          PROX_SPORTS         11.460         1.473        0.069      7.781    0.000          8.573         14.348 
       PROX_ELDERCARE         -4.361         1.803       -0.020     -2.418    0.016         -7.896         -0.826 
          PROX_HAWKER        -17.539         2.211       -0.068     -7.934    0.000        -21.872        -13.205 
           PROX_TRAIN        -37.834         2.962       -0.100    -12.771    0.000        -43.641        -32.026 
            PROX_SMKT         22.892         6.719        0.026      3.407    0.001          9.720         36.063 
          PROX_GD_SCH         -4.703         0.785       -0.046     -5.989    0.000         -6.242         -3.164 
           PROX_MALLS        -13.515         3.022       -0.035     -4.473    0.000        -19.439         -7.592 
NUM_KINDERGARTEN_350M       8485.827      1022.839        0.065      8.296    0.000       6480.684      10490.970 
   NUM_CHILDCARE_350M      -4600.690       494.407       -0.078     -9.305    0.000      -5569.911      -3631.469 
      NUM_PRI_SCH_1KM     -14988.484       736.682       -0.175    -20.346    0.000     -16432.654     -13544.313 
      NUM_CLINIC_350M       6839.424       525.413        0.100     13.017    0.000       5809.419       7869.428 
------------------------------------------------------------------------------------------------------------------
Note

Observations

As compared to the earlier model, we can notice that R-squared value remains the same. By removing variables which were not statistically significant in the previous model, we have simplified the model without sacrificing explanatory power.

We will save this model for further analysis:

write_rds(resale_mlr2, "data/rds/resale_mlr2.rds")

15.1 Test for Linear Regression Assumptions

Referencing In-Class Exercise 07, we will run a few more tests on the resale_mlr2 model.

15.1.1 Multicolinearity Check with VIF

Variance Inflation Factor (VIF) analysis is conducted to detect the presence of multicollinearity among the predictors. High VIF values suggest redundancy, indicating that some predictors might need to be removed.

vif <- performance::check_collinearity(resale_mlr2)

kable(vif,
      caption = "Variance Inflation Factor (VIF) Results") %>%
  kable_styling(font_size = 18)
Variance Inflation Factor (VIF) Results
Term VIF VIF_CI_low VIF_CI_high SE_factor Tolerance Tolerance_CI_low Tolerance_CI_high
floor_area_sqm 1.844959 1.779020 1.916478 1.358292 0.5420176 0.5217904 0.5621072
remaining_lease 2.444561 2.348634 2.547311 1.563509 0.4090715 0.3925709 0.4257794
storey_order 1.153047 1.123093 1.190291 1.073800 0.8672674 0.8401310 0.8903985
hip 1.398739 1.355403 1.447358 1.182683 0.7149297 0.6909139 0.7377877
mup 1.351977 1.311067 1.398267 1.162746 0.7396574 0.7151708 0.7627373
PROX_CBD 1.689703 1.631571 1.753185 1.299886 0.5918201 0.5703906 0.6129062
PROX_CC 1.386885 1.344162 1.434911 1.177661 0.7210405 0.6969075 0.7439580
PROX_LIB 1.477513 1.430134 1.530110 1.215530 0.6768132 0.6535478 0.6992352
PROX_SPORTS 1.799846 1.736173 1.869027 1.341584 0.5556030 0.5350378 0.5759794
PROX_ELDERCARE 1.589665 1.536586 1.647995 1.260819 0.6290633 0.6067980 0.6507934
PROX_HAWKER 1.703971 1.645120 1.768190 1.305362 0.5868646 0.5655502 0.6078584
PROX_TRAIN 1.412188 1.368159 1.461482 1.188355 0.7081212 0.6842369 0.7309093
PROX_SMKT 1.316993 1.277916 1.361564 1.147603 0.7593056 0.7344496 0.7825240
PROX_GD_SCH 1.363063 1.321576 1.409903 1.167503 0.7336416 0.7092687 0.7566722
PROX_MALLS 1.406714 1.362968 1.455734 1.186050 0.7108764 0.6869387 0.7336932
NUM_KINDERGARTEN_350M 1.399690 1.356305 1.448357 1.183085 0.7144440 0.6904376 0.7372971
NUM_CHILDCARE_350M 1.601359 1.547688 1.660290 1.265448 0.6244695 0.6023044 0.6461250
NUM_PRI_SCH_1KM 1.707058 1.648052 1.771436 1.306544 0.5858033 0.5645137 0.6067771
NUM_CLINIC_350M 1.363316 1.321816 1.410168 1.167611 0.7335054 0.7091352 0.7565349

To visualize the results:

plot(vif) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Note

Observations

The plot above shows the Variance Inflation Factor (VIF) values for each predictor in the resale_mlr2 model, indicating the degree of multicollinearity among them.

The variance inflation factor (VIF) quantifies the extent of correlation between one predictor and the other predictors in a model. The higher the value, the greater the correlation of the variable with other variables. For our VIF plot, it shows that all predictors have VIF values below 5, suggesting that multicollinearity is not a significant issue in this model.

remaining_lease and storey_order have the highest VIFs but are still within acceptable limits.

15.1.2 Test for Non-Linearity

In multiple linear regression, it is important for us to test the assumption that linearity and additivity of the relationship between dependent and independent variables.

To test the linearity assumption, we use the ols_plot_resid_fit() function.

ols_plot_resid_fit(resale_mlr2)

Note

Observations

Most of our data points are clustered around the 0 line, with a few outliers. However, these deviations are within an acceptable range, allowing us to conclude that the relationships between the dependent and independent variables are linear.

15.1.3 Test for Normality Assumption

We can assess the normality of residuals with a histogram:

ols_plot_resid_hist(resale_mlr2)

Note

Observations:

The figure reveals that the residual of resale_mlr2 resembles normal distribution.

For a formal test, we can use ols_test_normality():

Unlike the class exercise, we need to sample a subset of residuals as these test has a sample limit of 5000. To ensure reproducibility, a seed is set before sampling.

set.seed(1234)

ols_test_normality(sample(residuals(resale_mlr), size = 5000))
-----------------------------------------------
       Test             Statistic       pvalue  
-----------------------------------------------
Shapiro-Wilk              0.9706         0.0000 
Kolmogorov-Smirnov        0.0463         0.0000 
Cramer-von Mises         417.5645        0.0000 
Anderson-Darling         21.1864         0.0000 
-----------------------------------------------
Note

Observations:

The summary table above reveals that the p-values of the four tests are way smaller than the alpha value of 0.05. Hence we will reject the null hypothesis and infer that there is statistical evidence that the residual are not normally distributed.

15.1.4 Test for Spatial Autocorrelation

Since the hedonic model involves geographically referenced data, it is important to visualize the residuals and test for spatial autocorrelation.

First, we export the residuals from the hedonic pricing model into a data frame and join it with the spatial dataframe.

mlr_res <- as.data.frame(resale_mlr2$residuals)
resale_res_sf <- cbind(resale_train_sf, 
                        resale_mlr2$residuals) %>%
  rename(`MLR_RES` = `resale_mlr2.residuals`)

Using tmap, we visualize the spatial distribution of the residuals:

Code
tmap_mode("plot")
tmap_options(check.and.fix = TRUE) 

tm_shape(mpsz_sf) +
  tm_polygons("REGION_N",alpha = 0.4) +
  tm_shape(resale_res_sf) +  
  tm_dots(
    col = "MLR_RES",
    alpha = 0.6,
    style = "quantile",
    size = 0.2,
    title = "MLR Residuals"
  ) +
  tm_layout(
    outer.margins = c(0.1, 0, 0, 0),
    main.title = "Multiple Linear Regression Residuals",
    main.title.position = "center",
    main.title.size = 1.5,
    legend.position = c("left", "bottom"),
    legend.title.size = 1.2,
    legend.text.size = 0.8,
    frame = TRUE,
    inner.margins = c(0.05, 0.05, 0.05, 0.05)
  )

Note

Observations

In the plot above, there are indications of spatial autocorrelation, we need to statistically validate this observation.

To proof that our observation is indeed true, we will perform the Moran’s I test.

\(H_o\): The residuals are randomly distributed (also known as spatial stationary) .

\(H_1\): The residuals are spatially non-stationary.

First, we will compute the distance-based weight matrix by using dnearneigh() function of spdep.

resale_res_sf <- resale_res_sf %>%
  mutate(nb = st_knn(geometry, k=6,
                     longlat = FALSE),
         wt = st_weights(nb,
                         style = "W"),
         .before = 1)

Next, global_moran_perm() of sfdep is used to perform global Moran permutation test. We set a seed for reproducibility.

set.seed(1234)
mlr_moran_perm <- global_moran_perm(resale_res_sf$MLR_RES, 
                  resale_res_sf$nb, 
                  resale_res_sf$wt, 
                  alternative = "two.sided", 
                  nsim = 99)

Then we save the result for future analysis:

write_rds(mlr_moran_perm, "data/rds/mlr_moran_perm.rds")
mlr_moran_perm <- read_rds("data/rds/mlr_moran_perm.rds")
mlr_moran_perm

    Monte-Carlo simulation of Moran I

data:  x 
weights: listw  
number of simulations + 1: 100 

statistic = 0.71167, observed rank = 100, p-value < 2.2e-16
alternative hypothesis: two.sided
Note

Observations:

The Global Moran’s I test for residual spatial autocorrelation shows that it’s p-value is less than 2.2e-16 which is less than the alpha value of 0.05. Hence, we will reject the null hypothesis that the residuals are randomly distributed.

Since the Observed Global Moran I = 0.71167 which is greater than 0, we can infer than the residuals resemble cluster distribution.

16 Random Forest (RF)

Referencing In-Class Excercise 08, we proceed to build out RF model with ranger.

Since the SpatialML package is based on the ranger package, coordinate data must be prepared before calibration.

resale_test_sf <- read_rds("data/rds/resale_geocoded_jit_ft_test_sf.rds")
coords <- st_coordinates(resale_sf)
coords_train <- st_coordinates(resale_train_sf)
coords_test <- st_coordinates(resale_test_sf)

write_rds(coords, "data/rds/coords.rds")
write_rds(coords_train, "data/rds/coords_train.rds")
write_rds(coords_test, "data/rds/coords_test.rds")

Additionally, the geometry field is removed:

train_nogeo <- resale_train_sf %>% 
  st_drop_geometry()

16.1 Calibrating RF Model

When calibrating the Random Forest (RF) model, we reduce num.trees from the typical 500 to 50 to decrease model complexity and computation time.

Additionally, we set importance = "impurity" to calculate variable importance based on the reduction in node impurity, allowing us to understand which predictors most strongly impact the model’s decisions. This will be analyzed collectively later on.

Note: we set a seed for reproducibility.

set.seed(1234)
rf <- ranger(resale_price ~  floor_area_sqm+ remaining_lease+ storey_order+
               hip+ mup+ PROX_CBD+ PROX_CC+ PROX_LIB+ PROX_SPORTS+
               PROX_ELDERCARE 
               #+ PROX_PARK
               + PROX_HAWKER+ PROX_TRAIN+ PROX_SMKT+
               PROX_GD_SCH+ PROX_MALLS+ NUM_KINDERGARTEN_350M+ NUM_CHILDCARE_350M
               # +NUM_BUS_STOP_350M
               + NUM_PRI_SCH_1KM+ NUM_CLINIC_350M,
             data=train_nogeo,
             num.trees = 50,
             importance="impurity")             

As usual, we save the model for downstream model evaluation:

write_rds(rf, "data/rds/rf.rds")
rf <- read_rds("data/rds/rf.rds")
rf
Ranger result

Call:
 ranger(resale_price ~ floor_area_sqm + remaining_lease + storey_order +      hip + mup + PROX_CBD + PROX_CC + PROX_LIB + PROX_SPORTS +      PROX_ELDERCARE + PROX_HAWKER + PROX_TRAIN + PROX_SMKT + PROX_GD_SCH +      PROX_MALLS + NUM_KINDERGARTEN_350M + NUM_CHILDCARE_350M +      NUM_PRI_SCH_1KM + NUM_CLINIC_350M, data = train_nogeo, num.trees = 50,      importance = "impurity") 

Type:                             Regression 
Number of trees:                  50 
Sample size:                      5852 
Number of independent variables:  19 
Mtry:                             4 
Target node size:                 5 
Variable importance mode:         impurity 
Splitrule:                        variance 
OOB prediction error (MSE):       1724866962 
R squared (OOB):                  0.9137379 
Note

Observations

As compared to resale_mlr2 model, rf has a higher R-squared (0.91) indicating that the model explains a significant portion of the variance in resale prices.

Further model evaluation will be explored later on.

17 Geographically Weighted Random Forest (gwRF)

Geographically Weighted Random Forest (gwRF) is an extension of the traditional Random Forest (RF) model that accounts for spatial heterogeneity in the data.

While a standard RF model generates predictions based solely on global patterns across all observations, gwRF introduces localized weights based on spatial coordinates. This allows it to capture variations in relationships between predictors and the target variable across different geographic locations.

17.1 Calibrating Geographically Weighted Random Forest (gwRF)

Before we calibrate the gwRF model, we first calculate the optimal bandwidth for gwRF model. This step is essential to account for local variations in the relationship between predictors and the target variable (resale_price) across different geographic locations.

In this case, we use cross-validation to select the optimal bandwidth, which helps ensure that the model generalizes well by finding a balance between model fit and complexity.

bw_adaptive <- bw.gwr(resale_price ~  floor_area_sqm+ remaining_lease+ storey_order+
               hip+ mup+ PROX_CBD+ PROX_CC+ PROX_LIB+ PROX_SPORTS+
               PROX_ELDERCARE 
               #+ PROX_PARK
               + PROX_HAWKER+ PROX_TRAIN+ PROX_SMKT+
               PROX_GD_SCH+ PROX_MALLS+ NUM_KINDERGARTEN_350M+ NUM_CHILDCARE_350M
               # +NUM_BUS_STOP_350M
               + NUM_PRI_SCH_1KM+ NUM_CLINIC_350M,
               data=resale_res_sf,
                  approach="CV",
                  kernel="gaussian",
                  adaptive=TRUE,
                  longlat=FALSE)

We save the output for reference:

write_rds(bw_adaptive, "data/rds/bw_adaptive.rds")
bw_adaptive <- read_rds("data/rds/bw_adaptive.rds")
bw_adaptive
[1] 72
Note

Observations

The optimal bandwidth is 72 meters.

Then we calibrate a gwRF model using the previously calculated adaptive bandwidth (bw_adaptive) to set the range of influence for each data point in the geographic weighting.

Note: we set a seed for reproducibility.

set.seed(1234)

gwRF_adaptive <- grf(formula = resale_price ~  floor_area_sqm+ remaining_lease+ storey_order+
               hip+ mup+ PROX_CBD+ PROX_CC+ PROX_LIB+ PROX_SPORTS+
               PROX_ELDERCARE 
               #+ PROX_PARK
               + PROX_HAWKER+ PROX_TRAIN+ PROX_SMKT+
               PROX_GD_SCH+ PROX_MALLS+ NUM_KINDERGARTEN_350M+ NUM_CHILDCARE_350M
               # +NUM_BUS_STOP_350M
               + NUM_PRI_SCH_1KM+ NUM_CLINIC_350M,
                     dframe=train_nogeo, 
                     bw = bw_adaptive,
                     ntree = 50,
                     kernel="adaptive",
                     verbose = TRUE,
                     coords=coords_train)

We save the output for reference:

write_rds(gwRF_adaptive, "data/rds/gwRF_adaptive.rds")

18 Model Evaluation

In this section, we evaluate three models—Multiple Linear Regression (MLR), Random Forest (RF), and Geographically Weighted Random Forest (GWRF)—using the test dataset. We load each saved model and generate predictions to assess their performance. These models are trained using the same set of variables to ensure a fair comparison.

model_mlr <- read_rds("data/rds/resale_mlr2.rds")
model_rf <- read_rds("data/rds/rf.rds")
model_gwrf <- read_rds("data/rds/gwRF_adaptive.rds")

18.1 Generating Predictions

  1. Multiple Linear Regression (MLR): We predict resale prices using model_mlr on the test dataset.
resale_test_sf$pred_mlr <- predict(object = model_mlr, newdata = resale_test_sf)
  1. Random Forest (RF): For RF predictions, we remove geometry data from the test dataset (test_nogeo) to ensure compatibility with model_rf.
test_nogeo <- resale_test_sf %>%
  st_drop_geometry()

rf_pred <- predict(model_rf, data = test_nogeo)

resale_test_sf$pred_rf <- rf_pred$predictions
  1. Geographically Weighted Random Forest (GWRF): The GWRF model requires both the test dataset and coordinates (coords_test) to incorporate spatial variability into predictions.
gwRF_test_data <- cbind(test_nogeo, coords_test) 

gwRF_pred <- predict.grf(model_gwrf, 
                         gwRF_test_data, 
                         x.var.name="X",
                         y.var.name="Y", 
                         local.w=1,
                         global.w=0)

resale_test_sf$pred_grf <- gwRF_pred

Then we save the output for reference:

write_rds(resale_test_sf, "data/rds/model_eval.rds")

18.2 Visualize Predictions by Model

Building on the previous section, we now visualize the predicted resale prices against the actual resale prices for three different models: Multiple Linear Regression (MLR), Random Forest (RF), and Geographically Weighted Random Forest (gwRF).

Ideally, the points should align along a straight line, indicating that the predictions closely match the actual values. Any deviation from this line suggests that the model is either overestimating or underestimating resale prices in certain ranges.

model_eval <- read_rds("data/rds/model_eval.rds")

Using a custom plot_pred function, we can visualize the scatterplots side by side:

Code
plot_pred <- function(data, x, y, title) {
  ggplot(data, aes(x = .data[[x]], y = .data[[y]])) +
    geom_point() +
    ggtitle(title) +
    theme(plot.title = element_text(hjust = 0.5))
}
Code
plot_mlr <- plot_pred(model_eval, "pred_mlr", "resale_price", "[MLR] Actual vs Pred")
plot_rf <- plot_pred(model_eval, "pred_rf", "resale_price", "[RF] Actual vs Pred")
plot_grf <- plot_pred(model_eval, "pred_grf", "resale_price", "[gwRF] Actual vs Pred")

ggarrange(plot_mlr, plot_rf, plot_grf, ncol = 3, nrow = 1)

Note

Observations

  • Multiple Linear Regression (MLR): The predicted values show a general trend aligning with the actual values, but there is noticeable spread as compared to other models.
  • Random Forest (RF): This model produces a more compact scatter, showing better alignment with actual values, particularly in the mid-range prices.
  • Geographically Weighted Random Forest (gwRF): The predictions appear similar to RF.

While the visualizations suggest some differences in performance, it’s challenging to determine which model is superior based on these plots alone. We will further evaluate the models using RSME and MAE to provide a clearer assessment.

18.3 Qualititive Metrics Assessment

The code below calculates and displays the RMSE (Root Mean Square Error) and MAE (Mean Absolute Error) for three different models and format the results neatly using kable.

Code
model_predictions <- list(
  MLR = model_eval$pred_mlr,
  RF = model_eval$pred_rf,
  gwRF = model_eval$pred_grf
)

metrics <- lapply(model_predictions, function(pred) {
  c(RMSE = rmse(model_eval$resale_price, pred),
    MAE = mae(model_eval$resale_price, pred))
})

metrics_df <- as.data.frame(do.call(rbind, metrics))

kable(metrics_df) %>%
  kable_styling(font_size = 18)
RMSE MAE
MLR 96025.16 73946.48
RF 70811.57 54771.09
gwRF 72055.22 55628.60
Note

Observations

Based on the RMSE (Root Mean Squared Error) and MAE (Mean Absolute Error) values, the Random Forest (RF) model performs best for predicting 5-room resale flat prices in the test set (July - September 2024):

  • RF has the lowest RMSE and MAE, showing higher accuracy and lower prediction error than both MLR and gwRF.
  • Geographically Weighted RF (gwRF) has slightly higher RMSE and MAE, making it the second-best model.
  • MLR has the highest RMSE and MAE, performing worst among the three.

This result is unexpected, as gwRF was expected to perform better due to additional geographical information. Let’s analyze RMSE and MAE at the town level to investigate further.

We will further investigate by identifying the best-performing model for each town, focusing on RMSE as our primary evaluation metric due to its sensitivity to larger errors, which is crucial for assessing model accuracy in predicting resale prices.

To do so, we created a dataframe metrics_town_df to calculate the RMSE and MAE for each model (MLR, RF, and gwRF) for each town. It then identifies the model with the lowest RMSE for each town and records it in the best_model_RMSE column.

Code
metrics_town_df <- model_eval %>%
  group_by(town) %>%
  summarise(
    RMSE_MLR = rmse(resale_price, pred_mlr),
    MAE_MLR = mae(resale_price, pred_mlr),
    RMSE_RF = rmse(resale_price, pred_rf),
    MAE_RF = mae(resale_price, pred_rf),
    RMSE_gwRF = rmse(resale_price, pred_grf),
    MAE_gwRF = mae(resale_price, pred_grf)
  ) %>%
  rowwise() %>%
  mutate(
    best_model_RMSE = case_when(
      RMSE_MLR <= RMSE_RF & RMSE_MLR <= RMSE_gwRF ~ "MLR",
      RMSE_RF <= RMSE_MLR & RMSE_RF <= RMSE_gwRF ~ "RF",
      TRUE ~ "gwRF"
    )
  ) %>%
  ungroup()
table(metrics_town_df$best_model_RMSE)

gwRF  MLR   RF 
  10    3   13 

From above, we can use that RF is the prefered model for 13 towns and gwRF is a close second at 10.

To visualize the results on a map:

Code
metrics_town_df <- metrics_town_df %>%
  st_drop_geometry() %>%
  mutate(town = ifelse(town == "KALLANG/WHAMPOA", "KALLANG", town)) %>% # to match with mpsz
  mutate(town = ifelse(town == "CENTRAL AREA", "DOWNTOWN CORE", town)) # to match with mpsz

tmap_mode("view")
tmap_options(check.and.fix = TRUE)

best_model_by_town <- mpsz_sf %>%
  left_join(metrics_town_df, by = c("PLN_AREA_N" = "town"))

tm_shape(best_model_by_town) +
  tm_polygons("best_model_RMSE", alpha = 1, title = "Best Model (RMSE)") +
  tm_layout(
    outer.margins = c(0.1, 0, 0, 0),
    main.title = "Best Model for Resale Price Prediction by Town",
    main.title.position = "center",
    main.title.size = 1.5,
    legend.position = c("left", "bottom"),
    legend.title.size = 1.2,
    legend.text.size = 0.8,
    frame = TRUE,
    inner.margins = c(0.05, 0.05, 0.05, 0.05)
  )
Note

Observations

The map visualizes the best-performing model for predicting resale prices by town, based on RMSE. The distribution of models shows that:

  • The Random Forest (RF) model (in purple) is the best fit for many northern and western regions.
  • The Geographically Weighted Random Forest (gwRF) model (in green) performs best in central and southeastern areas.
  • Interestingly, the Multiple Linear Regression (MLR) model (in yellow) only outperforms in the Hougang area.

18.4 Variable Importance

Another way to understand these models is to study their variable importance. Variable importance quantifies how much a specific variable helps in making accurate predictions of the target variable (dependent variable). Higher importance means that the variable has a stronger influence on the model’s predictions.

To do so, we extract the importance data from the models and visualize them.

For the MLR model, we can use varImp() of caret package to retrieve the importance of variables in the MLR model.

mlr_vi_df <- varImp(model_mlr, scale = FALSE) %>%
  mutate(importance = .[[1]]) %>%  
  rename(variable = 1) %>%
  select(importance) %>%          
  rownames_to_column(var = "variable") %>%  
  arrange(desc(importance))

For RF and gwRF model, we can extract the variable.importance data easily. We then sort the data by importance in the similar format as above.

rf_vi_df <- as.data.frame(model_rf$variable.importance) %>%
  mutate(importance = .[[1]]) %>%  
  select(importance) %>%          
  rownames_to_column(var = "variable") %>%
  arrange(desc(importance))  
gwrf_vi_df <- as.data.frame(model_gwrf$Global.Model$variable.importance) %>%
  mutate(importance = .[[1]]) %>%  
  select(importance) %>%        
  rownames_to_column(var = "variable") %>%  
  arrange(desc(importance))  

To plot the variable importance of each model, we will use a custom function, plot_variable_importance using the plotly library:

Code
plot_variable_importance <- function(data, title = "Variable Importance", fill_color = "steelblue") {
  plot_ly(
    data,
    x = ~importance,
    y = ~reorder(variable, importance),
    type = "bar",
    orientation = "h",
    marker = list(color = fill_color)
  ) %>%
    layout(
      title = title,
      xaxis = list(title = "Importance"),
      yaxis = list(title = "Variable")
    )
}
Code
plot_variable_importance(mlr_vi_df, "[MLR] Variable Importance", "slateblue")
Code
plot_variable_importance(rf_vi_df, "[RF] Variable Importance", "orange")
Code
plot_variable_importance(gwrf_vi_df, "[gwRF] Variable Importance", "seagreen")
Note

Comparsion of Variable Importance across mdoels

  • Across all three models (MLR, RF, gwRF), the top predictors are consistently PROX_CBD, remaining_lease, storey_order, floor_area_sqm, indicating these factors play crucial roles in predicting resale prices.

  • In MLR, PROX_CBD and remaining_lease have a more pronounced lead in importance over other variables compared to RF and gwRF.

  • To my surprise, PROX_LIB and PROX_HAWKER have higher variable importance in RF-based models than commonly expected variables like PROX_TRAIN (proximity to train stations) or PROX_MALLS (proximity to malls).

  • Variables like mup and hip exhibit very low importance across all models, indicating limited predictive power. This is expected, as observed in the EDA, a significant portion of resale transactions involve recently MOP-ed (5-year-old) flats, whereas HIP/MUP typically applies to flats that are 30 years old or more.

18.5 Visualing Errors

Given the similarity in Variable Importance for the RF-based models (RF and GWRF), we can gain further insights by examining the spatial distribution of prediction errors. This interactive map will help us identify areas where each model performs well or struggles, highlighting any spatial patterns in the errors.

Using the model_eval dataframe, calculate the error for each model by subtracting the predicted resale prices from the actual resale prices.

The errors for RF and gwRF models are computed as follows:

model_eval <- model_eval %>%
 mutate(
   error_rf = resale_price - pred_rf,
   error_grf = resale_price - pred_grf
 )

To create a side-by-side box plot of RF and gwRF errors:

plot_ly(model_eval, y = ~error_rf, type = "box", name = "RF Error") %>%
  add_trace(y = ~error_grf, type = "box", name = "gwRF Error") %>%
  layout(
    title = "Comparison of RF and gwRF Errors",
    yaxis = list(title = "Error"),
    boxmode = "group"  # Places the boxes side by side
  )
Note

Observations

The box plot shows that the RF model has a tighter error distribution around zero, with fewer extreme outliers, indicating more consistent predictions.

In comparison, the gwRF model has a wider spread with more outliers, suggesting higher variability and larger errors in certain cases.

Overall, RF appears to provide more stable predictions.

Next, we find and label extreme error rows for each model, so that we can understand them on the interactive map.

error_extremes_df <- bind_rows(
  model_eval %>% filter(error_rf == min(error_rf, na.rm = TRUE)) %>% mutate(error_type = "min_rf"),
  model_eval %>% filter(error_rf == max(error_rf, na.rm = TRUE)) %>% mutate(error_type = "max_rf"),
  model_eval %>% filter(error_grf == min(error_grf, na.rm = TRUE)) %>% mutate(error_type = "min_grf"),
  model_eval %>% filter(error_grf == max(error_grf, na.rm = TRUE)) %>% mutate(error_type = "max_grf")
) %>%
  select(error_type, everything())


error_extremes_df
Simple feature collection with 4 features and 34 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 20857.34 ymin: 29478.58 xmax: 29217.29 ymax: 38519.5
Projected CRS: SVY21 / Singapore TM
# A tibble: 4 × 35
  error_type month   town    flat_type floor_area_sqm flat_model remaining_lease
  <chr>      <chr>   <chr>   <chr>              <dbl> <chr>                <dbl>
1 min_rf     2024-07 CLEMEN… 5 ROOM               123 Improved               714
2 max_rf     2024-09 BUKIT … 5 ROOM               113 Improved              1050
3 min_grf    2024-07 CLEMEN… 5 ROOM               123 Improved               714
4 max_grf    2024-09 ANG MO… 5 ROOM               121 Improved              1046
# ℹ 28 more variables: resale_price <dbl>, address <chr>, geometry <POINT [m]>,
#   storey_order <int>, age <dbl>, hip <dbl>, mup <dbl>, PROX_CBD <dbl>,
#   PROX_CC <dbl>, PROX_LIB <dbl>, PROX_SPORTS <dbl>, PROX_ELDERCARE <dbl>,
#   PROX_PARK <dbl>, PROX_HAWKER <dbl>, PROX_TRAIN <dbl>, PROX_SMKT <dbl>,
#   PROX_GD_SCH <dbl>, PROX_MALLS <dbl>, NUM_KINDERGARTEN_350M <dbl>,
#   NUM_CHILDCARE_350M <dbl>, NUM_BUS_STOP_350M <dbl>, NUM_PRI_SCH_1KM <dbl>,
#   NUM_CLINIC_350M <dbl>, pred_mlr <dbl>, pred_rf <dbl>, pred_grf <dbl>, …
Note

Observations

It is interesting to observe that both the minimum errors for the RF and gwRF models occur with the same flat in Clementi.

To create the interactive point symbol map:

Code
tmap_options(check.and.fix = TRUE)
tmap_mode("view")
error_rf <- tm_shape(mpsz_sf)+
  tm_polygons(alpha = 0.1) +
tm_shape(model_eval) +  
  tm_dots(col = "error_rf",
          border.col = "gray60",
          border.lwd = 1) +
  tm_view(set.zoom.limits = c(11,14))

error_grf<- tm_shape(mpsz_sf)+
  tm_polygons(alpha = 0.1) +
tm_shape(model_eval) +  
  tm_dots(col = "error_grf",
          border.col = "gray60",
          border.lwd = 1) +
  tm_view(set.zoom.limits = c(11,14))

tmap_arrange(error_rf, error_grf, 
             asp=1, ncol=2,
             sync = TRUE)
Note

Observations

Both maps show similar spatial error patterns for the RF and gwRF models. Slight differences in the error magnitudes can be observed, particularly with gwRF showing slightly higher error values in certain areas (Clementi North), as seen in the expanded scale on the right map.

19 Conclusion

In conclusion, this exercise aimed to predict resale prices of 5-room HDB flats in Singapore using various geospatial and aspatial features, ultimately comparing three models: Multiple Linear Regression (MLR), Random Forest (RF), and Geographically Weighted Random Forest (gwRF).

The data acquisition process proved to be a challenging and eye-opening experience. Combining data from multiple web sources and government portals required navigating numerous steps and technical hurdles, from dealing with changing data structures on official portals to geocoding addresses accurately. This process highlighted the importance of knowing where to look and handle different data sources, reinforcing the critical role of data sourcing skills in real-world projects.

In the feature engineering phase, we learned to manipulate geospatial data to create relevant variables, such as proximity-based and radius-based features. These features, including proximity to the Central Business District (CBD), schools, parks, and other amenities, added valuable locational context to the dataset. The EDA phase revealed general trends and provides a deeper understanding of the dataset. We also used multicollinearity checks to select variables, ensuring that the model wasn’t impacted by redundant predictors.

The model-building process was highly iterative. Initially, we included all variables, then gradually refined the models by removing statistically insignificant ones. For fair comparison, all models uses the same list of variables.

The model evaluation phase involved multiple steps, including prediction visualization and quantitative metrics such as RMSE and MAE. We further assessed performance at the town level to gain insights into model behavior across different regions, which highlighted the strengths and weaknesses of each model. Additionally, we examined variable importance (VI) to understand which factors were most influential in predicting resale prices. Across all models, the top three most important variables were PROX_CBD (proximity to CBD), remaining_lease (years left on the lease), and storey_order (floor level).

  • PROX_CBD: Proximity to Singapore’s CBD emerged as one of the strongest predictors, reflecting the premium associated with central locations and accessibility to economic and lifestyle hubs. Flats closer to the CBD generally command higher resale values due to their desirability and convenience.
  • remaining_lease: This factor represents the number of years left on the flat’s 99-year lease. Properties with more remaining lease years tend to have higher resale prices, as buyers value the longevity of their investment.
  • storey_order: Floor level also emerged as a key predictor, with higher floors generally commanding a premium, likely due to better views, reduced noise, and a greater sense of privacy. This trend aligns with general buyer preferences for high-floor units in densely populated urban environments.

Among the three models, RF surprisingly performed best, delivering the highest accuracy and lowest prediction error, even outperforming the spatially aware gwRF model. While gwRF was expected to excel due to its ability to account for spatial heterogeneity, RF’s performance suggests that the influence of geographic variations might be less pronounced for the 5-room flat resale prices within the selected timeframe. Further investigation at the town level showed that RF performed consistently well across most towns, while gwRF was a close second, particularly excelling in central and southeastern areas. For the final comparsion, we used an interactive map to visualize prediction errors, highlighting areas where RF and gwRF performed well or struggled, revealing spatial trends and unique town characteristics affecting model accuracy.

Overall, the use of RF and gwRF models demonstrated the power of machine learning in capturing complex relationships to predict resale prices accurately. For aspiring homeowners, real estate analysts, and policymakers, these insights offer a valuable perspective on the factors driving HDB resale prices across Singapore, setting the stage for future work in geographically weighted modeling and housing market analysis.

20 References